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Abstract — Necessary conditions for asymptotically optimal 
sliding-block or stationary codes for source coding and rate- 
constrained simulation of memoryless sources are presented and 
used to motivate a design technique for trellis-encoded source 
coding and rate-constrained simulation. The code structure has 
intuitive similarities to classic random coding arguments as 
well as to "fake process" methods and alphabet-constrained 
methods. Experimental evidence shows that the approach pro- 
vides comparable or superior performance in comparison with 
previously published methods on common examples, sometimes 
by significant margins. 

Index Terms — Source coding, simulation, rate-distortion, trellis 
source encoding 

I. Introduction 

THE basic goal of Shannon source coding with a fi- 
delity criterion or lossy data compression is to covert 
an information source {X n } into bits which can be decoded 
into a good reproduction of the original source, ideally the 
best possible reproduction with respect to a fidelity criterion 
given a constraint on the rate of transmitted bits. Memoryless 
discrete-time sources have long been a standard benchmark for 
testing source coding or data compression systems. Although 
of limited interest as a model for real world signals, inde- 
pendent identically distributed (IID) sources provide useful 
comparisons among different coding methods and designs. 
In addition, specific examples such as Gaussian and uniform 
sources can provide intuitive interpretations of how coding 
schemes yield good performance and they can serve as build- 
ing blocks for more complicated processes such as linear 
models driven by IID processes. 

A separate, but intimately related, topic is that of rate- 
constrained simulation — given a "target" random process 
such as an IID Gaussian process, what is the best possible 
imitation of the process that can be generated by coding 
a simple discrete IID process with a given (finite) entropy 
rate? Here "best" can be quantified by a metric on random 
processes such as the generalized Ornstein d distance (or 
Monge-Kantorovich transportation distance/ Wasserstein dis- 
tance extended to random processes). For example, what is the 
best imitation Gaussian process with only one bit per symbol? 
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Intuitively and mathematically lfl2l . [14|, if the source 
code is working well, one would expect the channel bits 
produced by the source encoder to be approximately IID 
and the resulting reproduction process to be as close to the 
source as possible with a one bit per symbol channel. Thus 
the decoder driven by coin flips should produce a nearly 
optimal simulation. Conversely, if an IID source driving a 
stationary code produces a good simulation of a source, the 
code should provide a good decoder in a source coding system 
with an encoder matching possible decoder outputs to the 
source sequence, e.g., a Viterbi algorithm. 

Rigorous results along this line were developed in iflOl . 
showing that the two optimization problems are equivalent 
and optimal (or nearly optimal) source coders imply optimal 
(or nearly optimal) simulators and vice versa for the specific 
case of stationary codes and sources that are _B-processes 
(stationary codings of IID processes). 

Results that are similar in spirit were developed for more 
general sources by Steinberg and Verdu ||3T| , where other deep 
connections between process simulation and rate-distortion 
theory were also explored. However, results in OTI are for 
asymptotically long block codes while our focus is on sta- 
tionary codes — especially on stationary decoders of modest 
memory — and on the behavior of processes rather than on the 
asymptotics of finite-dimensional distributions, which might 
not correspond to the joint distributions of a stationary process. 

We introduce a design technique for trellis-encoded source 
coding based on designing a stationary decoder to approxi- 
mately satisfy necessary conditions for optimality (analogous 
to the Lloyd algorithm for vector quantizer design [8|) and 
using a matched Viterbi algorithm as an encoder (analogous 
to the minimum distortion encoder in the Lloyd algorithm). 
The combination of a good decoder with a matched search 
algorithm as the encoder is the most common implementation 
of trellis source codes. Previous work (35), ll23l . 11321 . H 
for trellis encoding system design has been based largely on 
intuitive guidelines, assumptions, or formal axioms for good 
code design. In contrast, we prove several necessary conditions 
which optimal or asymptotically optimal source codes must 
satisfy, including some properties simply assumed in the past. 
Examples of such properties are Pearlman's observations ||23l 
that the marginal reproduction distribution should approximate 
the Shannon optimal reproduction and that the reproduction 
process should be approximately white. We give a code 
construction which provably satisfies a key necessary condi- 
tion and which is shown experimentally to satisfy the other 
necessary conditions while providing performance comparable 
to or superior to previously published work, and in many cases, 



2 



remarkably close to the theoretical limit. 

The rest of the paper is organized as follows. In Section |H| 
we give an overview of definitions and concepts we need 



for stating our results and in Section III we state and prove 



the necessary conditions for optimum trellis-encoded source 



code design. Section IV introduces the new design technique 



and Section [V] presents experimental results for encoding 
memoryless Gaussian, uniform, and Laplacian sources. 

II. Preliminaries 

A. A note on notation 

We deal with random objects which will be denoted by capi- 
tal letters. These include random variables X n , TV-dimensional 
random vectors X N = (Xo, Xi, . . . , X^-i), and random 
processes {X n ;n £ Z}, where Z is the set of all integers. 
The generic notation X might stand for any of these random 
objects, where the specific nature will either be clear from con- 
text or stated (this is to avoid notational clutter when possible). 
Lower case letters will correspond to sample values of random 
objects. For example, given an alphabet A (such as the real line 
M or the binary alphabet {0, 1}), then a random variable X n 
may take on values x n £ A, an TV-dimensional random vector 
X N may take on values x N £ A N , the Cartesian product 
space, and a random process {X n ; n £ Z} may take on values 
{x n ; n £ Z} = (• • • , X-i, xq, xx, • • • ) € ^4°°. A lower case 
letter without subscript or superscript may stand for a member 
of any of these spaces, depending on context. 



B. Stationary and sliding-block codes 

A stationary or sliding-block code is a time-invariant filter, 
in general nonlinear. It operates on an input sequence to pro- 
duce an output sequence in such a way that shifting the input 
sequence results in a shifted output sequence. More precisely, 
a stationary code / with an input alphabet A (typically K or 
a Borel subset for an encoder or {0, 1} for a decoder) and 
output alphabet B (typically {0, 1} for an encoder or some 
subset R for a decoder) is a measurable mapping (with respect 
to suitable a -fields) of an infinite input sequence (in A°°) 
into an infinite output sequence (in 73°°) with the property 
that f(T A x) = T B f{x), where T A is the (left) shift on 
A°°, that is, T A {- ■ ■ ,X- X ,x ,xx, ■••) = (••_■ , a; , xx, x 2 ■ ■ ■ ). 
The sequence-to-sequence mapping from / : A°° — > B°° is 
described by the sequence-to-symbol mapping defined by code 
output at time 0, f(x) = f(x)o since f(x) n — f(T^x)n = 
f(T^x). More concretely, the sequence-to-symbol mapping / 
usually depends on only a finite window of the data, in which 
case the output random process, say {Y n }, can be expressed 



f(X n - 



, X n , ■ ■ ■ , X n+N2 ), a mapping on the 



as Y n 

contents of a shift register containing L = N1+N2+I samples 
of the input random process {X n }. Both / and / will be 
referred to as stationary or sliding-block codes. 

Unlike block codes, stationary codes preserve statistical 
characteristics of the coded process, including stationarity, 
ergodicity, and mixing. If a stationary and ergodic source 
{X n } is encoded into bits by a stationary code /, which are 
in turn decoded into a reproduction process {X n } by another 



stationary code g, then the resulting pair process {X n ,X n } 
and output process {X n } are also stationary and ergodic. 

Given any block code, a stationary code with similar proper- 
ties can be constructed (at least in theory) and vice versa. Thus 
good codes of one type can be used to construct good codes 
of the other (at least in theory) and the optimal performance 
for the two classes of codes is the same l28l . fl6l . 13, ATI . 

C. Fidelity and distortion 

A distortion measure d(x, y), x £ A, y £ A is a non- 
negative measurable function (with respect to suitable <x- 
fields). A fidelity criterion is a family of distortion measures 
d N (x N ,y N ),x N £ A N ,y N £ A N , N — 1,2,.... We assume 
that the fidelity criterion is additive (or single-letter): 



'N 



(x N ,y N 



N-l 

) = d ( x * 

i=0 



where d = d\. Throughout the paper, we make the standard as- 
sumption that A C A and d(x, x) = 0. Given random vectors 
X N ,Y N with a joint distribution ir N , the average distortion 
is defined by the expectation = E[d N (X N ,Y N )]. 

Given a stationary pair process {X n ,Y n }, the average 
distortion between A^-tuples is given by the single-letter char- 
acterization N~ 1 E[d N (X N ,Y N )] = E[d(X ,Y )] = din 1 ) 
and hence a measure of the fidelity (or, rather, lack of 
fidelity) of a stationary coding and decoding of a stationary 
source X n into a reproduction X n is the average distortion 
D(f,g) — E[d(Xo, Xq)]. The emphasis in this paper will 
be the case where A = M. and the distortion is the common 
squared error distortion, d(x,y) = (x — y) 2 . Also of interest 
is the Hamming distortion, where d(x, y) = if x — y and 1 
otherwise. 

Throughout the paper we assume that the stationary process 
{X n } and the distortion measure d satisfy the following 
standard reference letter condition: there exists x £ A such 
that E[d(Xo,x)] < 00. In particular, when the distortion is 
the squared error, we always assume that the source has finite 
variance. 

D. Optimal source coding 

Let C(A, B) denote the collection of all sliding-block codes 
with input alphabet A and finite output alphabet B of size 
||-B||. The operational distortion-rate function for source X is 
defined by 

8x(R)= inf D(f,g). 

feC(A,B),g£C(B,A):\\B\\<2 R 

Note that 8x(R) is defined for the discrete set of R values 
such that R = log k for some nonnegative integer k. 

E. Distance measures for random vectors and processes 

A distortion measure d induces a natural notion of a 
"distance" between random vectors and processes (the quotes 
will be removed when the relation to a true distance or metric 
is clarified). The optimal transportation cost between two 
probability distributions, say px and py, corresponding to 
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random variables (or vectors) defined on a common (Borel) 
probability space (A,B(A)) with a nonnegative cost function 
d is defined as 

TOx,/iy)= inf E*d{X,Y), 

where V(p,x, My) is me class of all probability distributions 
on (A,B(A)) 2 having fix and fiy as marginals, that is, 
ir(F x A) = fi x (F), ir(A x F) = p, Y (F) for all F G 
B(A). The reader is referred to Villani ll33ll and Rachev and 
Riischendorf E51 for extensive development and references. 
The most important special case is when the cost function 
is a nonnegative power of an underlying metric: d(x, y) = 
m(x,y) r , where A is a complete, separable metric (Polish) 
space with respect to m. In this case T(p,x, /iy) imn ( 1 ' 1 ' r ) 
is a metric. The notation T2 and 7o will be used to denote 
the two most important cases of the optimal transportation 
cost with respect to the squared error and Hamming distance, 
respectively. 

Given two processes with process distributions fix and 
fiy on (A°° ,B(A°°)), let fix N an d My™ denote the induced 
iV-dimensional distributions for all positive integers N. Let 
dx be an additive distortion measure induced by d(x,y), 
x,y G A. Define the (generalized) d distance [15| between 
two stationary processes 

d(nx,Hy) = supN- 1 T{fJ, X ",HY N )- 

N 

If d is a metric, then so is d. If d is the Hamming metric, 
this is Ornstein's d-bar distance (21], [22]. If d is a power of 
an underlying metric, then d{p,x, jUy) lnin ' 1,1 ' r ' will also be a 
metric. We will refer to d as the "d-distance" whether or not 
it is actually a true metric. We distinguish the most important 
cases by subscripts, in particular ^2 denotes d with d squared 
error (and hence \fd~2 is a metric) and do denotes d with d 
equal to the Hamming distance (do is a metric). 

For stationary processes there is a simpler characterization 
of d: 

d(fi Xl fi Y ) = inf E*[d(X ,Y )] (1) 

where the infimum is over all stationary processes (or station- 
ary and ergodic processes if fix and fiy are ergodic). This and 
many other properties of the d and generalized d are detailed 
in ED . Il22l . ifTSIl . lfT3l . Properties relevant here include the 
following: 

1) For stationary processes, 

d(/ix,Aty)= I™ N~ 1 T{fi x >*,VY«)- (2) 

N— ¥00 



2) If the processes are both IID, then 

d(f± X ,flY) = T(nx ,VY ). 



(3) 



3) If the processes are both stationary and ergodic, the dis- 
tance can be expressed as the infimum over the limiting 
distortion between any two frequency-typical sequences 
of the two processes. Thus the d-distance between the 
two processes is the amount by which a frequency- 
typical sequence of one process must be changed in 



a time average d sense to produce a frequency-typical 

sequence of another process. 
The d process distance can be used to characterize both 
the optimal source coding and the optimal rate-constrained 
simulation problem. Let {X n } be a random process described 
by a process distribution fix and let {Z n } be an IID equiprob- 
able random process with alphabet B of size ||_B|| = 2 R 
and distribution fiz- The optimal simulation of the process 
X = {X n } with process distribution fix given the process 
Z = {Z n } with process distribution fiz and reproduction 
alphabet A is characterized by 

A X \z(R)= inf . d(px,H?rz)) (4) 
fec(B,A) 

where M/fz) = i L zI l is the process distribution resulting 
from a stationary coding of Z using /, i.e., for all events 
F n f{Z) {F) = fizirHF)). The notation for A X \ Z {R) is 
redundant since R determines the distribution of Z and vice 
versa. As in the definition of the operational rate-distortion 
function, R is of the form R = logfc for some nonnegative 
integer k. 

F. Entropy rate 

Alternative characterizations of the optimal source coding 
and simulation performance can be stated in terms of the 
entropy rate of a random process. As we will be dealing 
with both discrete and continuous alphabet processes and 
with some borderline processes that have continuous alphabets 
yet finite entropy, suitably general notions of entropy as 
found in mathematical information theory and ergodic theory 
are needed (see, e.g.., |24), ED, E3, 02)). For a finite- 
alphabet random process, define as usual the Shannon entropy 
of a random vector or, equivalently, of its distribution by 
H(X N ) = H{fji xN ) = -Y, x n fi X N{x N )\ogfjL X N{x N ) and 
the Shannon entropy rate of the process X by H(X) = 
H(fix) = infjv N^ 1 H(X N ). If the process is stationary, then 



H(X) = lim N~ 1 H(X N ). 

N— yoo 



(5) 



In the general case of a continuous alphabet, the entropy 
rate is given by the Kolmogorov-Sinai invariant H(X) = 
supj H(fij, x A where the supremum is over all finite- 
alphabet stationary codes. It is important to note that |5]l need 
not hold when the alphabet is not finite and that a random 
process with a continuous alphabet can have an infinite finite- 
order entropy and a finite entropy rate. 

G. Constrained entropy rate optimization 

A stationary and ergodic process is called a S-process if it 
is obtained by a stationary coding of an IID process. If the 
source is stationary and ergodic, then flOl 

A X \z(R) = inf _ dbix,"), (6) 

_B-processes v.H{u)<R 

that is, the best simulation by coding coin flips in a stationary 
manner has the same performance as the best simulation of X 
by any _B-process having entropy rate R bit per symbol or less. 
If X were itself discrete and a S-process with entropy rate 
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less than or equal to R, then Ornstein's isomorphism theorem 
071 . ll22l (or the weaker Sinai-Ornstein theorem) implies that 
A-x\z{R) = 0. In words, a £>-process can be stationarily 
encoded into any other B process having equal or smaller 
entropy rate. 

The d-distance also yields a characterization of the opera- 
tional distortion rate function |[T6l : 



6x(R)= inf d(n x ,u), 

u:H(u)<R 



(7) 



where the infimum is over all stationary and ergodic processes. 
Comparing ^ and obviously A x \z(R) > Sx(R)- If the 
source X is also a £>-process, then the two infima are the same 
and A X{Z (R) = dx{R). 

A related operational distortion-rate function resembling 
the simulation problem replaces the encoder/decoder with a 
common encoder output/decoder input alphabet by a single 
code into a reproduction having a constrained entropy rate. 
Suppose that a source X is encoded by a sliding-block code 
/ directly into a reproduction X with process distribution 
fi x = M/pc)- What coding yields the smallest distortion under 
the constraint that the output entropy rate is less than or equal 
to Rl In this case, unsurprisingly 

inf E[d(X ,X )]=S x (R). (8) 

These relations implicitly define optimal codes and optimal 
performance, but they do not say how to evaluate the optimal 
performance or design the codes for a particular source. The 
Shannon rate-distortion function solves the first problem. 

H. Shannon rate-distortion functions 

In the discrete alphabet case the Nth order average mutual 
information between random vectors X N and Y N is given by 
1(1", Y N ) = H(X N ) + H{Y N ) - H{X N , Y N ). In general 
I(X N , Y N ) is given as the supremum of the discrete alphabet 
average mutual information over all possible discretizations or 
quantizations of X N and Y N . If the joint distribution of X N 
and Y N is ir N , then we also write l{w N ) for I(X N ,Y N ). 

The Shannon rate-distortion function [27] is defined for a 
stationary source X by 

Rx(D) = MN^RxnID) = lim iV _1 i?x w (D) 



N 



N- 



R x n(D) 



inf N^Ihr") (9) 

tt n :TT N eV(^ xN ),N- 1 d(n N )<D 



where V(fJ,x N ) i s m e collection of all joint distributions tt n 
for X N ,Y N with first marginal distribution ^x N - The dual 
distortion-rate function is 

D X (R) = iuiN- 1 Dx"(R) = lim N^Dx^iR) 

N N—toc 



D x n(R) = inf N~ x EdU N ). 

TT N :TT N eV{ f i xN ),N- 1 I(7r N )<R 

Source coding theorems show that under suitable conditions 
6 X (R) = D X (R). (See, e.g., fl6), 0, HD for source coding 
theorems for stationary codes.) 



Csiszar [3| provided quite general versions of Gallager's Q 
Kuhn-Tucker optimization for evaluating the rate-distortion 
functions for finite dimensional vectors, in particular restating 
the optimization over joint distributions ir N as an optimization 
over the reproduction distribution /i^. When an optimizing 
reproduction distribution exists, it will be referred to as the 
Shannon optimal reproduction distribution. Csiszar provides 
conditions under which an optimizing distribution exists. 

The following lemma and corollary are implied by the proof 
of Csiszar's Theorem 2.2 and the extension of the reproduction 
space from compact metric to Euclidean spaces discussed at 
the bottom of p. 66 of J3). The lemma shows that if the 
distortion measure is a power of a metric derived from a norm, 
then there exists an optimizing joint distribution and hence also 
a Shannon optimal reproduction distribution. In the corollary, 
the roles of distortion and mutual information are interchanged 
to obtain the distortion-rate version of the result. 

Lemma 1: Let X be a random vector with an alphabet A 
which is a finite-dimensional Euclidean space with norm ||x||. 
Assume the reproduction alphabet A = A and a distortion 
measure d(x, y) — \\x — y\\ r , r > 0, such that i?[||X||' r ] < 00. 
Then for any D > there exists a distribution tt on A x A 
achieving the the minimum of (|9j. Hence for any N, a Shan- 
non TV-dimensional optimal reproduction distribution exists 
for the A^th order rate-distortion function. 

Corollary 1: Given the assumptions of the lemma, suppose 
that 7P n % n = 1, 2, ... is sequence of distributions on A x A 
with marginals p,x and fi Y ( n l f° r which for n = 1, 2, . . . 

7(tt ( ^) = I(X,Y^ n) ) < R, (10) 
lim E[d(X,Y^)} = D X (R). (11) 

n— >oo 

Then fiy(n) has a subsequence that converges weakly to a 
Shannon optimal reproduction distribution. If the Shannon 
distribution is unique, then fJ, Y M converges weakly to it. 

/. IID sources 

If the process X is IID, then 

R X {D) = R Xo {D) = inf I(X ,Y ). 

Tr:TteV(p,x ),Ed{X ,Y )<D 

(12) 

If a Shannon optimal distribution exists for the first-order rate 
distortion-function, then this guarantees that it exists for all 
finite-order rate-distortion functions and that the optimal 7V- 
th order distribution is simply the product distribution of N 
copies of the first-order optimal distribution. 

Rose ll26l proved that for a continuous input random vari- 
able and the squared error distortion, the Shannon optimal 
reproduction distribution will be (absolutely) continuous only 
in the special case where the Shannon lower bound to the rate 
distortion function holds with equality, e.g., in the case of a 
Gaussian source and squared error distortion. In other cases, 
the optimum reproduction distribution is discrete, and for 
source distributions with bounded support (e.g., the uniform 
[0, 1) source), the Shannon optimal reproduction distribution 
will have finite support, that is, it will be describable by a 
probability mass function (PMF) with a finite domain. This 
last result is originally due to Fix J6|. Rose proposed an 
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algorithm using a form of annealing which attempts to find 
the optimal finite alphabet directly by operating on the source 
distribution, avoiding the indirect path of first discretizing 
the input distribution and then performing a discrete Blahut 
algorithm — the approach inherent to the constrained alphabet 
rate-distortion theory and code design algorithm of Finamore 
and Pearlman 0. There is no proof that Rose's annealing 
algorithm actually converges to the optimal solution, but our 
numerical results support his arguments. 

III. Necessary Conditions for Optimal and 
Asymptotically Optimal Codes 

A sliding-block code (/, g) for source coding is said to 
be optimum if it yields an average distortion equal to the 
operational distortion-rate function, D(f,g) — 5x(R)- Unlike 
the simple scalar quantizer case (or the nonstationary vector 
quantizer case), however, there are no simple conditions for 
guaranteeing the existence of an optimal code. Hence usually it 
is of greater interest to consider codes that are asymptotically 
optimal in the sense that their performance approaches the 
optimal in the limit, but there might not be a code which 
actually achieves the limit. More precisely, a sequence of rate- 
R sliding-block codes /„, g n , n = 1, 2, . . ., for source coding 
is asymptotically optimal (a.o.) if 



lim D(f n ,g n )=S x (R)=D x (R). 



(13) 



An optimal code (when it exists) is trivially asymptotically 
optimal and hence any necessary condition for an asymptoti- 
cally optimal sequence of codes also applies to a fixed code 
that is optimal by simply equating every code in the sequence 
to the fixed code. 

Similarly, a simulation code g is optimal if d(^x, Pgfz)) — 
Ax I z(R) an d a sequence of codes g n is asymptotically 
optimal if 



lim d(jJL X ,(J>g n (Z)) = &x\z{R)- 



(14) 



In this section we exclusively focus on the squared error 
distortion and assume that the real-valued stationary and 
ergodic process X = {X n } has finite variance. 

A. Process approximation 

The following lemma provides necessary conditions for 
asymptotically optimal codes which are a slight generalization 
and elaboration of Theorem 1 of Gray and Linder [ 14] . A proof 
is provided in the Appendix. 

Lemma 2: (Condition 1) Given a real-valued stationary 
ergodic process X, suppose that f n ,g n n — 1,2,... is an 
asymptotically optimal sequence of stationary source codes 
for X with encoder output/decoder input alphabet B of size 
||S|| = 2 R . Denote the resulting reproduction processes by 
and the B -ary encoder output/decoder input processes 
by £/("). If D X (R) > 0, then 



lim c20x,Mxc»)) 

n—>oo 

lim H(X^) 

n— >oo 

lim d Q {U (n) ,Z) 



D X (R) 

lim H{U {n) ) = R 

n— >oo 

o, 



where Z is an IID equiprobable process with alphabet size 
2 R . 

These properties are quite intuitive: 

• The process distance between a source and an approx- 
imately optimal reproduction of entropy rate less than 
R is close to the Shannon distortion rate function. Thus 
frequency-typical sequences of the reproduction should 
be as close as possible to frequency-typical source se- 
quences. 

• The entropy rate of an approximately optimal reproduc- 
tion and of the resulting encoded B-aiy process must be 
near the maximum possible value. 

• The sequence of encoder output processes approaches an 
IID equiprobable source in the Ornstein process distance. 
If R — 1, the encoder output bits should look like fair 
coin flips. 

If X is a £>-process, then a sequence of a.o. simulation 
codes g n yielding a reproduction processes X^ satisfies 

]un n _ yoo d(jJLx,Vxw) = A x\z{R) = D X {R) and a similar 
argument to the proof of the previous lemma implies that 



lim r , 



H{X 



H(Z) = R. 



B. Moment conditions 

The next set of necessary conditions concerns the squared 
error distortion and resembles a standard result for scalar and 
vector quantizers (see, e.g., (8), Lemmas 6.2.2 and 11.2.2). 
The proof differs, however, in that in the quantization case the 
centroid property is used, while here simple ideas from linear 
prediction theory accomplish a similar goal. Define in the 
usual way the covariance COV(X, Y) = E[(X - E{X)){Y - 
E{Y))]. 

Lemma 3: (Condition 2) Given a real-valued stationary 
ergodic process X, suppose that If f„,g n is an asymptot- 
ically optimal sequence of codes (with respect to squared 
error) yielding reproduction processes X^ n ' with entropy rate 
H(X) < R, then 



lim 

n— foo 



lim E(Xi n) ) = E(X 
n—>oo 

COV(X ,X^ n) ) 



lim 4(„, - a 2 Xo -D x (R) 



(15) 



(16) 



(17) 



Defining the error as e 
conditions become 



A 



X , then the necessary 



lim S(e n) ) 

n— >oo 

lim E(e^X^)) 

n—>oo 

lim a I--* 



= 
= 

= D X (R). 



(18) 
(19) 
(20) 



The results are stated for time k = 0, but stationarity ensures 
that they hold for all times k. 

Proof: For any encoder/decoder pair (f n ,gn) yielding a repro- 
duction process X^ 



D(fn,9n) 



> 



inf D(f n ,ag n 



b) 



> D x (R)=m{D(f,g) 

f,9 
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where the second inequality follows since scaling a sliding- 
block decoder by a real constant and adding a real constant 
results in another sliding-block decoder with entropy rate no 
greater than that of the input. The minimization over a and b 
for each n is solved by standard linear prediction techniques 
as 



inf £>(/„, ag n 

a,b 



b) 



COW(X ,X^ n) ) 



E{X Q )~a n E{X ( n) ) 
D(f n ,a n g n + b n ) 



2 2 2 



(21) 
(22) 

(23) 



Combining the above facts we have that since (f n ,gn) is 
an asymptotically optimal sequence, 

D X (R) = lim £>(/», g n ) > lim D(f n 

n— >oo n— foo 

> D X {R) (24) 
and hence that both inequalities are actually equalities. The 



final inequality (24i being an equality yields 

a%-D x (R), 



lim a„a 



2_2 



(25) 



Application of asymptotic optimality and pi) to 



+ [ J B(Xo)- J B(l(" ) )]) 2 ' 



'A',. 



-a 2 , -2COV(X ,^ n) ) 



[£7(X ) - i« 



results in 

D X {R) 



lim 

n— too 



(1 - 2a„)crL„) 



[E(X ) - E(X' Q 



(26) 



Subtracting ( |25] l from p6| ) yields 
lin, - ^ 2 ~ 2 



(1 - a„) 2 4(„, + [£(*„) - E(X^ n) )] 2 j = 0. (27) 

Since both terms in the limit are nonnegative, both must con- 
verge to zero since the sum does. Convergence of the rightmost 
term in the sum proves ( 15 1. Provided Dx(R) < o"x ' wmc h 
is true if R > 0, (25} and (p7)i together imply that (a„- l) 2 /<z 2 
converges to and hence that 



lim a„ 

n— >oc 



lim 

n— ^oo 



COV(X ,X^) 



4- 



(28) 



If X is a _B-process so that Ax\z{R) = D X (R), then a 
similar proof yields corresponding results for the simulation 
problem. If g n is an asymptotically optimal (with respect to d 2 
distance) sequence of stationary codes of an IID equiprobable 
source Z with alphabet B of size R — log which produce 
a simulated process l'"', then 



lim E(X^ nh j 

n—>oo 

lim ctL„) 

rv, — ^-OO -^-n 



9 



&x\z{R)- 



C. Finite-order distribution Shannon conditions for IID pro- 
cesses 

Several code design algorithms, including randomly popu- 
lating a trellis to mimic the proof of the trellis source encoding 
theorem [34], are based on the intuition that the guiding 
principle of designing such a system for an IID source should 
be to produce a code with marginal reproduction distribution 
close to a Shannon optimal reproduction distribution [35], 0, 
l23l . While highly intuitive, we are not aware of any rigorous 
demonstration to the effect that if a code is asymptotically 
optimal, then necessarily its marginal reproduction distribution 
approaches that of a Shannon optimal. Pearlman |23| was 
the first to formally conjecture this property of sliding-block 
codes. The following result addresses this issue. It follows 
from standard inequalities and Csiszar [3] as summarized in 
Corollary [T] 

Lemma 4: (Condition 3a) Given a real-valued IID process 
X with distribution nx, assume that f n ,g n is an asymptot- 
ically optimal sequence of stationary source encoder/decoder 
pairs with common encoder output/decoder input alphabet B 
of size R = log||B|| which produce a reproduction process 
X™. Then a subsequence of the marginal distribution of 
the reproduction process, U-B-(n) converges weakly and in 



72 to a Shannon optimal reproduction distribution. If the 
Shannon optimal reproduction distribution is unique, then 
/!,-,(„) converges to it. 

Proof: Given the asymptotically optimal sequence of codes, 
let 7r„ denote the induced process joint distributions on 
(I,1W). The encoded process has alphabet size 2 H and 
hence entropy rate less than or equal to R. Since coding cannot 
increase entropy rate, the entropy rate of the reproduction 
(decoded) process is also less than or equal to R. By standard 
information theoretic inequalities (e.g., ifTTI . p. 193), since the 
input process is IID we have for all N that 



1 

N 



/(tt. 



1 

N 



7(X iv ,X iv ) > 



1 



JV-l 
t=0 



I(X ,X^ 



no 



(30) 



This proves (16i and with (26 1 proves (17i and also that 



The leftmost term converges to the mutual information rate 
between the input and reproduction, which is bound above by 
the entropy rate of the output so that I(Xq, Xq) < R, all n. 
Since the code sequence is asymptotically optimal, ( p"3j ) holds. 
Thus the sequence of joint distributions ir n for (Xq^X^) 
error. Eq. ([18]) follows from {L5]). Eq. ([T9]l follows from ([15)- meets the conditions of Corollary [Tj and hence ^ ( „) has a 

subsequence which converges weakly to a Shannon optimal 
distribution. If the Shannon optimal distribution /Uy- is unique, 



lim COV (X , Xi n) ) = a\ a - D X (R). (29) 
Finally consider the conditions in terms of the reproduction 



( |29| l and some algebra. Eq. ( |20] > follows from (18 1 and the 
asymptotic optimality of the codes. □ 
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then every subsequence of of fi$-(n) has a further subsequence a reproduction process X^ n \ For all k ^ 0, 



which converges to /iy- , which implies that fj, x ( n ) converges 
weakly to /iy . The moment conditions (15i and §77) ) of 
Lemmajijimply that E[(X { Q n) ) 2 } converges to E[(X Q ) 2 }. The 
weak convergence of a subsequence of n x(n} (or the sequence 
itself) and the convergence of the second moments imply 
convergence in 1~2 l33ll - □ 

Since the source is IID, the TV-fold product of a one- 
dimensional Shannon optimal distribution is an TV-dimensional 
Shannon optimal distribution. If the Shannon optimal marginal 
distribution is unique, then so is the /V-dimensional Shannon 
optimal distribution. Since Csiszar's G] results hold for the 
TV-dimensional case, we immediately have the first part of the 
following corollary. 

Corollary 2: (Condition 3b) Given the assumptions of the 
lemma, for any positive integer N let fi xin ) denote the 
TV-dimensional joint distribution of the reproduction process 
X( n \ Then a subsequence of the TV-dimensional reproduction 
distribution fi x{n) converges weakly and in T2 to the TV- 
fold product of a Shannon optimal marginal distribution (and 
hence to an TV-dimensional Shannon optimal distribution). If 
the one dimensional Shannon optimal distribution is unique, 
then /itj£(„) converges weakly and in T2 to its TV-fold product 
distribution. 

Proof: The moment conditions ( [T5| and ( fTT] )) of Lemma [3] 
imply that E[(X { k n) ) 2 } converges to E[(X k ) 2 } for k 

. \ I The weak convergence of the TV-dimensional 

distribution of a subsequence of /J, x(n ) (or the sequence 
itself) and the convergence of the second moments imply 
convergence in 72 1331 - □ 

There is no counterpart of this result for optimal codes 
as opposed to asymptotically optimal codes. Consider the 
Gaussian case where the Shannon optimal distribution is a 
product Gaussian distribution with variance o\ — Dx(R)- If 
a code were optimal, then for each TV the resulting TVth order 
reproduction distribution would have to equal the Shannon 
product distribution. But if this were true for all TV, the 
reproduction would have to be the IID process with the 
Shannon marginals, but that process has infinite entropy rate. 

If TV is a 73-process, then a small variation on the proof 
yields similar results for the simulation problem: given an 
IID target source X, the TVth order joint distributions 
of an asymptotically optimal sequence of constrained rate 
simulations will have a subsequence that converges 

weakly and in T2 to an TV-dimensional Shannon optimal 
distribution. 



D. Asymptotic uncorrelation 

The following theorem proves a result that has often been 
assumed or claimed to be a property of optimal codes. Define 
as usual the covariance function of the stationary process X^ n ' 
by K xM [k) = COV(X l ( " ) ,X 4 ( " ) fc ) for all integer k. 

Lemma 5: (Condition 4) Given a real-valued IID process X 
with distribution fix, assume that f n ,g n is an asymptotically 
optimal sequence of stationary source encoder/decoder pairs 
with common alphabet B of size R = log ||-B|| which produce 



lim K xM (k) = 



(31) 



and hence the reproduction processes are asymptotically un- 
correlated. 

Proof. If the Shannon optimal distribution is unique, then 
H X (n) converges in T2 to the TV-fold product of the Shannon 
optimal marginal distribution by Corollary [2] As Lemma [6] 
in the Appendix shows, this implies the convergence of 
K x(n) (k) = COV(X { k n) , X^ n) ) to for all k ^ 0. □ 

Taken together these necessary conditions provide straight- 
forward tests for code construction algorithms. Ideally, one 
would like to prove that a given code construction satisfies 
these properties, but so far this has only proved possible for 
the Shannon optimal reproduction distribution property — as 
exemplified in the next section. The remaining properties, 
however, can be easily demonstrated numerically. 

IV. An Algorithm for Sliding-Block Simulation 
and Source Decoder Design 

We begin with a sliding-block simulation code which ap- 
proximately satisfies the Shannon marginal distribution neces- 
sary condition for optimality. Matching the code with a Viterbi 
algorithm (VA) encoder then yields a trellis source encoding 
system. 

A. Sliding-block simulation code/source decoder 

Consider a sliding-block code <?£ of length L of an 
equiprobable binary IID process Z which produces an output 
process X defined by 



X„ 



J n-L+1 



), 



(32) 



where the notation makes sense even if L is infinite, in 
which case g views a semi-infinite binary sequence. Since 
the processes are stationary, we emphasize the case n = 0. 
Suppose that the ideal distribution for Xq is given by a 
CDF F, for example the CDF corresponding to the Shannon 
optimal marginal reproduction distribution of Lemma [T] Given 
a CDF F, define the (generalized) inverse CDF F^ 1 as 
F- 1 ^) = inf{r: F(r) > u} for < u < 1. If U 
is a uniformly distributed continuous random variable on 
(0,1), then the random variable F _1 (t/) has CDF F. The 
CDF can be approximated by considering the binary i-tuple 
u L = (uo, Mi, ... , Ul-i) comprising the shift register entries 
as the binary expansion of a number in (0, 1): 



L-l 



b(u L ) = J^^ 2 ^ 1 +2~ L ~\ 



(33) 



i=0 



and defining 

g{z n , Z n - 



, Zn-L+l) — 

F-\b{Z n ,Z n _ u - 



. 'Zj-rt.— 



n-L+1 



))■ (34) 



If the Z n is a fair coin flip process, the discrete random 
variable b(Z n , Z n _±, ■ ■ ■ ,Z„_l+i) is uniformly distributed 
on the discrete set {2" i ^ 1 , 2" L " 1 + 2- L ,2- L ~ 1 + 2 x 
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2~ L , • • • , 2~ i_1 + 1 - 2~ L }, that is, it is a discrete approxi- 
mation to a uniform (0, 1) that improves as L grows, and the 
distribution of g(Z n , Z n _x, ■ ■ ■ , ^ n -i+i) converges weakly 
to F, satisfying a necessary condition for an asymptotically 
optimal sequence of codes. If L is infinite, then the marginal 
distribution will correspond to the target distribution exactly! 
This fulfills the necessary condition of weak convergence for 
an asymptotically optimal code of Lemma [4] 

The code as described thus far only provides the correct 
approximate marginals; it does not provide joint distributions 
that match the Shannon optimal joint distribution — nor can 
it exactly since it cannot produce independent pairs. We adopt 
a heuristic aimed at making pairs of reproduction samples 
as independent as possible by modifying the code in a way 
that decorrelates successive reproductions and hence attempts 
to satisfy the necessary condition of Lemma [5] Instead of 
applying the inverse CDF directly to the binary shift register 
contents, we first permute the binary vectors, that is, the 
codebook of all 2 L possible shift register contents is permuted 
by an invertible one-to-one mapping V : {0, 1} L — > {0, 1} L 
and the binary vector V(u Ls ) is used to generate the discrete 
uniform distribution. A randomly chosen permutation V is 
used, but once chosen it is fixed so that sliding-block decoder 
is truly stationary. Such a random choice to obtain a code 
that is then used for all time is analogous to the traditional 
Shannon block source coding proof of randomly choosing a 
decoder codebook which is then used for all time. Thus our 
decoder is 

g(Z n , Zn-li ' ' ' i Zn-L+l) = 

Fy o \b(V{Z n , Z n _!, ■ ■ • , Z n _ i+ i)), (35) 

where Fy (y) is a Shannon optimal reproduction distribution 
obtained either analytically (as in the Gaussian case) or from 
the Rose algorithm (to find the optimum finite support). 

Intuitively, the permutation should make the resulting se- 
quence of arguments of the mapping (the number in (0, 1) 
constructed from the permuted binary symbols) resemble an 
independent sequence and hence cause the sequence of branch 
labels to locally appear to be independent. The goal is to 
satisfy the necessary conditions on joint reproduction distribu- 
tions of Corollary [2] but we have no proof that the proposed 
construction has this property. The experimental results to 
be described show excellent performance approaching the 
Shannon rate-distortion bound and show that the branch labels 
are indeed uncorrelated. The permutation is implemented 
easily by permuting the table entries defining g. For the 
constrained-rate simulation problem, the permutation does not 
change the marginal distribution of the coder output, which 
still converges weakly to the Shannon optimal reproduction 
distortion as L — > oo, even in the Gaussian case. This approach 
is in the spirit of Rose's mapping approach to finding the 
rate-distortion function [26 1 since it involves discretizing a 
continuous uniform random variable which is the argument to 
a mapping into the reproduction space, rather than discretizing 
the source. 

The decoder design involves no training (assuming that the 
Shannon optimal marginal distribution is known). 



B. Trellis encoding 

If the decoder of a source coding system is a finite-length 
sliding-block code, then encoding can be accomplished using 
a VA search of the trellis diagram labeled by the available de- 
coder outputs. A trellis is a directed graph showing the action 
of a finite-state machine with all but the newest symbol in 
the shift register constituting the state and the newest symbol 
being the input. Branches connecting each state are labeled 
by the output (or an index for the output in a reproduction 
codebook) produced by receiving a specific input in a given 
state. As usually implemented, the VA yields a block encoder 
matched to the sliding-block decoder. A source coding system 
having this form is a trellis source encoding system. 

The theoretical properties of asymptotically optimal codes 
developed here are for the combination of stationary encoder 
and decoder, but our numerical results use the traditional trellis 
source encoding structure of a block VA matched to a sliding- 
block decoder. In fact, we perform a full search on the entire 
test sequence since this provides the smallest possible average 
distortion encoding using the given decoder. This apparent 
mismatch of a theoretical emphasis on overall stationary 
codes with a hybrid stationary decoder/block encoder merits 
explanation. First, our emphasis is on decoder design and given 
a sliding-block decoder, no encoder can yield smaller average 
distortion than a matched VA algorithm operating on the entire 
dataset. Available computers permit such an implementation 
for datasets and decoders of interesting size. A source coding 
theorem for a block Viterbi encoder and a stationary decoder 
may be found in iflOl . Second, using standard techniques 
for converting a block code into a sliding-block code, a VA 
block encoder can be approximated as closely as desired 
by a sliding-block code. Such approximations originate in 
Ornstein's proof of his isomorphism theorem [21], [22] and 
have been developed specifically for tree and trellis encoding 
systems, e.g., in Section VII of (9), and for block source 
codes in general in ll28l . ifTTI . These constructions embed 
a good block code into a stationary structure by means of 
a punctuation sequence which inserts rare spacing between 
long blocks — which in practice would mean adding signif- 
icant computational complexity to the straightforward Viterbi 
search of the approximately optimal decoder output. Other, 
simpler, means of stationarizing the VA such as incremental 
tree and trellis encoding JT] , ifTUll have been considered, but 
they are not supported by coding theorems. Experimentally, 
however, they have been shown to provide essentially the 
same performance as the usual block Viterbi encoder. The 
hybrid code with a VA encoder and a stationary decoder 
remains the simplest implementation and takes full advantage 
of the stationary decoder which is designed here. Third, our 
necessary conditions for optimal stationary codes focus on 
the reproduction process and hence depend on the decoder 
and its correspondence to an optimal simulation code. The 
Associate Editor has pointed out that the theoretical results 
for stationary codes can likely be reconciled with our use of a 
block encoder/stationary decoder by extending our necessary 
conditions to incorporate hybrid codes such as fixed-rate (or 
variable-rate 11361 ) trellis encoding systems by replacing our 
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marginal distributions by average marginal distributions. We 
suspect this is true and that our results will hold for any coding 
structure yielding asymptotically mean stationary processes, 
but we have chosen not to attempt this here in the interests of 
simplicity and clarity. 

A brief overview of the history of trellis source encoding 
provides useful context for comparing the numerical results. A 
stationary decoder produces a time-invariant trellis and trellis 
branch labels that do not change with time. The original 1974 
source coding theorem for trellis encoded IID sources |34| was 
proved for time-varying codes by using a variation of Shannon 
random coding — successive levels of the trellis were labeled 
randomly based on IID random variables chosen according to 
the test channel output distribution arising in the evaluation of 
the Shannon rate-distortion function. 

Early research on trellis encoding design was concerned 
with time-varying trellises, reflecting the structure of the 
coding theorem. In particular, Wilson and Lytle 1 35 1 populated 
their trellis using IID random labels chosen according to 
the Shannon optimal reproduction distribution. A later source 
coding theorem for time-invariant trellis encoding [ 10] was 
based on the sliding-block source coding theorem lfl6ll . 
and was purely an existence proof; it did not suggest any 
implementable design techniques. Two early techniques for 
time-invariant code design were the fake process design [17] 
and a Lloyd clustering approach conditioned on the shift 
register states 1291 . 11301 . The former technique was based 
on a heuristic argument involving optimal simulation and 
the d-distance formulation of the operational distortion rate 
function. The idea was to color a trellis with a process as 
close in d as possible to the original source. While the goal is 
correct, the heuristic adopted to accomplish it was flawed: the 
design attempted to match the marginal distribution and the 
power spectral density of the reproduction with those of the 
original source. As pointed out by Pearlman |f23l and proved 
in this paper, the marginal distribution of the trellis labels 
should instead match the Shannon optimal distribution, not 
the original source distribution. 

Pearlman's theoretical development (231 was based on 
his and Finamore's constrained-output alphabet rate-distortion 
0, which involved a prequantization step prior to to designing 
a trellis encoder for the resulting finite-alphabet process. 
Pearlman provided a coding theorem and an implementation 
for a time-invariant trellis encoding, but used the artifice 
of a subtractive dithering sequence to ensure the necessary 
independence of successive trellis branch labels over the code 
ensemble. Because of the dithering, the overall code is not 
time-invariant. 

Marcellin and Fisher in 1990 [18| introduced trellis-coded 
quantization (TCQ) based on an analogy with coded modula- 
tion in the dual problem of trellis decoding for noisy channels. 
The technique provided a coding technique of much reduced 
complexity that has since become one of the most popular 
compression systems for a variety of signals. The dual code 
argument is strong, however, only for the uniform case, but 
variations of the idea have proved quite effective in a variety of 
systems. TCQ has a default assignment of reproduction values 
to trellis branches using a Lloyd-optimized quantizer, but the 



levels can also be optimized. 

Some techniques, including TCQ in our experiments, tend 
to reach a performance "plateau" in that performance im- 
provement with complexity becomes negligible well before 
the complexity becomes burdensome. In TCQ this can be 
attributed to constraints placed on the system to ensure low 
complexity. The technique introduced here has not (yet) shown 
any such plateau. 

More recently, van der Vleuten and Weber ]32l combined 
the fake process intuition with TCQ to obtain improved trellis 
coding systems for IID sources. They incorrectly stated that 
lTP7ll had shown that a necessary condition for optimality for 
trellis reproduction labels for coding an IID source is that the 
reproduction process be uncorrelated (white) when the branch 
labels are chosen in an equiprobable independent fashion. 
This is indeed an intuitively desirable property and it was 
used as a guideline in ifTTl — but it was not shown to be 
necessary. Eriksson et al. J4| used linear congruential (LC) 
recursions to generate trellis labels and reproduction values 
to develop the best codes of the time for IID sources to 
date by establishing a set of "axioms" of desirable properties 
for good codes (including a flat reproduction spectrum) and 
then showing that a trellis decoder based on an inverse CDF 
of a sequence produced by linear recursion relations meets 
the conditions. Because of the CDF matching and spectral 
control, the system can also be viewed as a variation on the 
fake process approach. Eriksson et al. observe that a problem 
with TCQ is the constrained ability to increase alphabet size 
for a fixed rate and they argue that larger alphabet size can 
always help. This is not correct in general, although it is for 
the Gaussian source where the Shannon optimal distribution 
is continuous. For other sources, such as the uniform, the 
Shannon optimal has finite support and optimizing for an 
alphabet that is too large or not the correct one will hurt in 
general. As with TCQ, the approach allowed optimization of 
the reproduction values assigned to trellis branch labels. 

V. Numerical Examples 

The random permutation trellis encoder was designed for 
three common IID test sources: Gaussian, uniform, and Lapla- 
cian. The results in terms of both mean squared error (MSE) 
and signal-to-noise ratio (SNR) are reported for various shift 
register lengths L indicated by RP I, here RP L stands for 
random permutation trellis coding algorithm with shift register 
length L. The test sequences were all of length 10 6 . The results 
for Gaussian, uniform and Laplacian sources are shown in 
Table [I] [II] and III respectively. 

Each test result is from one random permutation; repeating 
the test with different random permutations has produced 
almost identical results. E.g. for IID Gaussian source, R = 1, 
L = 16, a total of 20 test runs have returned MSE in the range 
between 0.2629 and 0.2643, with an average of 0.2634. 

The distortion-rate function Dx(R) for all three sources 
are also listed in the tables. For uniform and Laplacian 
sources, Dx(R) are numerical estimations produced by the 
Rose algorithm, in both cases, the reported distortions are 
slightly lower in comparison to the results reported in [18|, 
GDI calculated using the Blahut algorithm [2|. 



to 



The rate R = 1 results of the random permutation trel- 
lis coder are compared to previous results of the linear 
congruential trellis codes (LC) of Eriksson, Anderson, and 
Goertz [4|, trellis coded quantization (TCQ) by Marcellin and 
Fischer fl8l . trellis source encoding by Pearlman 11231 based 
on constrained reproduction alphabets and matching the Shan- 
non optimal marginal distribution, a Lloyd-style clustering 
algorithm conditioned on trellis states by Stewart et al. |29|, 
[30 1, and the Linde-Gray fake process design [17]. The rate 
R = 2 results are compared with Eriksson et al.'s LC codes 
and Marcellin and Fisher's TCQ. The rate R = 3, 4 results 
are compared with TCQ which are the only available previous 
results for these rates. 

Eriksson et al.'s LC codes use 512 states for R = 1 and 
256 states for R = 2, which are equivalent to shift register 
length L — 10 in both cases. Marcellin's TCQ uses 256 states 
for all rates, corresponding to shift register length 9,10,11,12 
for rate 1,2,3,4 respectively. Pearlman's results and Stewart's 
results are for L — 10, and Linde/Gray uses a shift register of 
length 9. The shift register length L is indicated as a subscript 
for all results. 

In the Gaussian example, there are 2 L reproduction levels in 
the random permutation codes, the result of taking the inverse 
Shannon optimal CDF, that of a Gaussian zero mean random 
variable with variance 1 — Dx(R), and evaluating it at 2 L 
uniformly spaced numbers in the unit interval. For the uniform 
source, there are 3, 6, 12, and 24 reproduction points for rates 
1, 2, 3, 4 bits chosen by the Rose algorithm for evaluating the 
first order rate-distortion function. Similarly, for the Laplacian 
source , there are 9, 17, 31, and 55 reproduction points ferrates 
1,2,3,4 bits, respectively. For rates R = 2,3, 4 bits, the trellis 
has 2 R outgoing branches from each node and 2 R incoming 
branches to each node. R new bits are shifted into the shift 
register and R old bits are shifted out at each transition. The 
Viterbi search now merges 2 R paths at each node compared to 
just 2 paths in the 1 bit case. The number of states in the trellis 
is 2^ L ~ R \ for the trellis structures with the same number of 
states; the R = 2 trellis has shift register length 1 bit longer 
compared to the R — 1 trellis and also has twice the number 
of branches/reproduction levels. 

Eriksson et al.'s LC codes use 2 i_1 reproduction points, 
the Linde/Gray fake process design uses 2 L reproduction 
points, in both cases, the reproduction points are generated 
by taking the inverse CDF of the source, evaluating it in 
the unit interval, and then multiplying with a scaling factor. 
Stewart also uses 2 L reproduction points, but the reproduction 
points are obtained through an iterative Lloyd-style training 
algorithm. Pearlman uses a simpler 4 symbol reproduction 
alphabet, produced by the Blahut algorithm. Marcellin's TCQ 
uses 2 R+1 reconstruction symbols, which are the outputs of the 
Llyod-Max quantizer. In both LC codes and TCQ, numerical 
optimization of the reproductions values were used to improve 
the results. The optimized results for LC codes and TCQ are 
listed in the tables with the notation "(opt)". 

The TCQ_9 and TCQ(opt)_9 results are from Marcellin and 
Fisher's TCQ paper |[T8l . The TCQ results at shift register 
length 12,16,20,24 are asterisked since they are from our own 
implementation of the TCQ following descriptions in [ 1 8 1 . 
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Gaussian Example 



In our implementation, the default reproduction values, not 
the optimized ones were used. The TCQ results are clearly 
showing a performance "plateau" as the shift register length 
increases. 

The effectiveness of the random permutation at forcing 
higher order distributions to look more Gaussian is shown in 
Fig. [T] The two dimensional scatter plot for adjacent samples 
with no permutation does not look Gaussian and is clearly 
highly correlated. When a randomly chosen permutation is 
used, the plot looks like a 2D Gaussian sample. In both figures, 
the x and y axis are the value of the samples. 

Fig. [2] shows the MSE of the random permutation trellis 
coder for IID Gaussian at R = 1 with various shift register 
length. The performance has not yet shown to hit a plateau as 
shift register length increases. 

The uniform IID source is of interest because it is simple, 
there is no exact formula for the rate-distortion function with 
respect to mean-squared error and hence it must be found by 
numerical means, and because one of the best compression 
algorithms, trellis-coded quantization (TCQ) is theoretically 
ideally matched to this example. So the example is an excellent 
one for demonstrating some of the issues raised here and for 
comparison with other techniques. 

The Rose algorithm yielded a Shannon optimal distribution 
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Rate(bits) 


MSE 


SNR(dB) 


RP_8 


1 


0.0203 


6.13 


RP_9 


1 


0.0195 


6.30 


RP_10 


1 


0.0190 


6.42 


RP_12 


1 


0.0184 


6.55 


RP_16 


1 


0.0179 


6.69 


RP_20 


1 


0.0176 


6.75 


RP_24 


1 


0.0175 


6.78 


RP_28 


1 


0.0174 


6.79 


Dx(R) 


1 


0.0173 


6.84 




1 


n ni Q/i 
u.uiy4 


6.33 


ic^(opt)_y 




U.U15J 


□ Jo 


LL_ 1 U 




ft ft 1 Q 1 

u.utyi 


£ An 
0.4U 


Lt^(opt)_lU 





ft ft1 TO 

u.ui /y 


0.0 / 




Rate(bits) 


MSE 


SNR(dB) 


RP_24 


2 


4.02e-03 


13.17 


D X (R) 


2 


3.96e-03 


13.23 


TCQ_10 


2 


4.24e-03 


12.93 


TCQ(opt)_10 


2 


4.18e-03 


13.00 


LC(opt)_lU 


L 


a 11a ni 
4. Ije-U3 


1 1 


RP_24 


3 


9.70e-04 


19.34 


D X {R) 


3 


9.46e-04 


19.45 


TCQ_11 


3 


10.0e-04 


19.20 


TCQ(opt)_l 1 


3 


9.95e-04 


19.23 


RP 24 


4 


2.39e-04 


25.43 


D X (R) 


4 


2.35e-04 


25.50 


TCQ_12 


4 


2.44e-04 


25.34 



TABLE II 
Uniform [0, 1) Example 





Rate(bits) 


MSE 


SNR(dB) 


RP_8 




0.2946 


5.31 


RP_9 




0.2789 


5.55 


RP_10 




0.2671 


5.73 


RP_12 




0.2532 


5.97 


RP_16 




0.2384 


6.23 


RP_20 




0.2306 


6.37 


RP_24 




0.2266 


6.45 


RP 28 




0.2234 


6.51 


D X (R) 




0.2166 


6.64 


TCQ_9 




0.3945 


4.04 


TCQ(opt)_9 




0.2793 


5.54 


LC_10 




0.2529 


5.97 


LC(opt)_10 




0.2495 


6.03 


PearlmanlO 




0.3058 


5.1456 




Rate(bits) 


MSE 


SNR(dB) 


RP 24 


2 


0.0581 


12.36 


D X (R) 


2 


0.0538 


12.69 


TCQ_10 


2 


0.1194 


9.23 


TCQ(opt)_10 


2 


0.0755 


11.22 


LC(opt)_10 


2 


0.0668 


11.75 


RP 24 


3 


0.0152 


18.18 


D X (R) 


3 


0.0134 


18.73 


TCQ_1 1 


3 


0.0333 


14.77 


TCQ(opt)_l 1 


3 


0.0201 


16.96 


RP 24 


4 


0.0046 


23.39 


D X (R) 


4 


0.0033 


24.79 


TCQ_12 


4 


0.0089 


20.53 



TABLE III 
Laplacian Example 



y 


0.2 


0.5 


0.8 


Pr{y) 


0.368 


0.264 


0.368 



+ No Permutation 




Random Permutation 




Fig. 1. Scatter plots of fake Gaussian 2-dimensional density: no permutation 
and random permutation 



* Random Permutaton Trellis Coder 
— Distortion Rate Function 




TABLE IV 

Shannon Optimal Reproduction Distribution for the Uniform 
(0, 1) Source 



14 16 18 20 22 24 26 28 30 
S hift Register Length L 



Fig. 2. Performance: 1 bit Gaussian 



12 



with an alphabet of size 3 for R = 1. The points and their 



probabilities are shown in Table IV 

Plugging the distribution into the random permutation trellis 
encoder led to a mapping g of (0 0.368) to 0.2, [0.368 0.632] 
to 0.5, and (0.632 1) to 0.8. 

For the Laplacian source of variance 1, the Rose algorithm 
yielded a Shannon optimal distribution with an alphabet of 
size 9 for the 1 bit case. The 9 reproduction points and their 
probabilities are listed in Table [V] 



y 


± 4.6273 


± 3.2828 


± 2.1654 


± 1.1063 





py{y) 


0.0014 


0.0065 


0.0285 


0.1266 


0.6740 



TABLE V 

Shannon Optimal Reproduction Distribution for the 
Laplacian Source 



For all three test sources — Gaussian, uniform, and Lapla- 
cian — the performance of the random permutation trellis 
source encoder is approaching the Shannon limit. Therefore, it 
is of interest to estimate the entropy rate of the encoder output 
bit sequence, which should be close to an IID equiprobable 
Bernoulli process since an entropy rate near 1 is a necessary 
condition for approximate optimality [ 12|, 1 14]. A "plug-in"(or 
maximum-likelihood) estimator was used for this purpose. 
The estimator uses the empirical probability of all words 
of a fixed length in the sequence to estimate the entropy 
rate. Bit sequences of length 10 6 produced by encoding the 
Gaussian, uniform, and Laplacian sources with trellis encoder 
of shift register length L = 12 were fed into the estimator, 
the resulting entropy rate estimation ranges from 0.9993 to 
0.9995. For comparison, the estimator yielded entropy rate of 
0.9998 for a randomly generated bit sequence of the same 
length. 

Eriksson et al.'s LC results for 1 bit at 512 states (equivalent 
to shift register length 10) for Gaussian source is better than 
the random permutation results for the same shift register 
length. This is likely the result of their exhaustive search over 
all possible ways of labeling the branches within the constraint 
of their axioms. A similar approach to the random permutation 
code would be to search for the permutation that produced the 
best results. Our results are from randomly chosen permuta- 
tions, so they reflect performance of the ensemble average 
(which we believe may eventually lead to a source coding 
theorem using random coding ideas). All permutations have 
the same marginals, but some permutations will have better 
higher order distributions. Such an optimization is feasible 
only for small L. We tested an optimization by exhaustion 
for L = 3 and found that the best MSE (SNR) was 0.3262 
(4.8647), while the average MSE (SNR) for all permutations 
was 0.3852 (4.1431). This demonstrates that the best permu- 
tation can provide notable improvement over the average, but 
we have no efficient search algorithm for finding optimum 
permutations. 

Appendix A 
Proof of Lemma[2] 

The encoded and decoded processes are both stationary and 
ergodic since the original source is. From |7]) and the source 



coding theorem, 

D(f n ,g n ) = E[d(X ,X^)]>d(fix,lijtw) 
> inf d(j* x ,v) = D X (R). 

v:H{y)<R 

The second inequality follows since stationary coding reduces 
entropy rate, and so R > H(U^) > H(X^). Since the 
leftmost term converges to the rightmost, the first equality of 
the lemma is proved. 

Standard inequalities of information theory yield 

R > H(U^) > H(XW) > I(X,XW) > R x (D(f n ,g n )) 

where the second inequality follows since mutual information 
rate is bounded above by entropy rate, and the third inequality 
follows from the process definition of the Shannon rate- 
distortion function ifTTl . Taking the limit as n — > oo, the 
rightmost term converges to R since the code sequence is 
asymptotically optimal (so that D(f ni g n ) — > Dx(R) > 0) 
and the Shannon rate-distortion function is a continuous 
function of its argument (except possibly at D = 0). Thus 
lim^oo H(UW) = lim n ^oo H(XW) = R. proving the 
second equality of the lemma. 

The final part requires Marton's inequality lfT9l relating 
Ornstein's d distance and relative entropy when one of the 
processes is IID. Suppose that [ijj and [iz are stationary 
process distributions for two processes with a common discrete 
alphabet and that \Ixjn and \i%n denote the finite dimensional 
distributions. For any integer N the relative entropy or infor- 
mational divergence is defined by 



H V n (u N ) 



In our notation Marton's inequality states that if U is a 
stationary ergodic process and Z is an IID process, then 



In 2 

2N 



)) 



1/2 



Since Z is an IID equiprobable process with alphabet size 2 



N 1 %{^u N ,^z N ) < 

and taking the limit as A' - 
Q of the d distance) 



In 2 

2N 



(NR — H{U N )) 



1/2 



oo yields (in view of property 



In 2 



(R-H(U)) 



1/2 



Applying this to U^ n ' and taking the limit using the previous 
part of the lemma completes the proof. □ 
Lemma 6: Let fi denote the A^-fold product of a proba- 
bility distribution fi on the real line such that J x 2 dn(x) < oo. 
Assume {vn} is a sequence of probability distribution on m ' A 



r(n) v (n) 



, Y, 



(n) 



JV 



such that lim,^ T 2 (M J Vn) = 0. If Y{ ' , F 2 
are random variables with joint distribution v n , then for all 

i + j, 

lim E[{Y^ n) - E{Y^))(Y^ - E(Y^))]=0. 
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Proof. The convergence of v n to fi N in T2 distance implies 
that there exist IID random variables Yi , . . . , Y/v with com- 
mon distribution /1 and a sequence or TV random variables 

^1 > ^2 ) • • • > ^jv^ with joint distribution all defined on 
the same probability space, such that 



lim B[(y. 

n— too 



(«) 



0. 



1, 



,iV. 



First note that this implies for all i 



lim E[{Y^f] = E[Y?}. 



(36) 



(37) 



Also, lim E\Y^ - yj = (Cauchy-Schwarz), so that for 

n—too 

all i, 



lim E{Y} n) ) = E(Yi). 



(38) 



Now the statement is direct convergence of the fact that in any 
inner product space, the inner product is jointly continuous. 
To be more concrete, letting (X,Y) = E(XY) and \\X\\ = 
[^(X 2 )] 1 / 2 for random variables X and Y with finite second 
moment defined on this probability space, we have the bound 



< 
< 

r{n) 



(n) y(n) 



3 

\<Xi 



(n) y(n) 



Y- 



(n) 



^7 1 



\Y 



(«)i 



\Y™-Y S \ 



\Y (n) -YA 



\Y 4 



Since converges to ||1^|| by d37) and -YJ con 



r (n) v"W\ 



verges to zero by ( 36 1, we obtain that ( Y^ , Yj 
to {YijYj), i.e, 



converges 



lim £(K (n W n) ) 
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